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Abstract. Starting from the mutual information we present a method in order to 
find a hamiltonian for a fully connected neural network model with an arbitrary, finite 
number of neuron states, Q. For small initial correlations between the neurons and 
the patterns it leads to optimal retrieval performance. For binary neurons, Q = 2, 
and biased patterns we recover the Hopficld model. For three-state neurons, (5 = 3, 
we find back the recently introduced Blume-Emery-Griffiths network hamiltonian. We 
derive its phase diagram and compare it with those of related three-state models. We 
find that the retrieval region is the largest. 
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One of the challenging problems in the statistical mechanics approach to associative 
memory neural networks is the choice of the hamiltonian and/or learning rule leading 
to the best retrieval properties including, e.g., the largest retrieval overlap, loading 
capacity, basin of attraction, convergence time. Recently, it has been shown that the 
mutual information is the most appropriate concept to measure the retrieval quality, 
especially for sparsely coded networks but also in general ([I], 0] and references therein). 

A natural question is then whether one could use the mutual information in a 
systematic way to determine a priori an optimal hamiltonian guaranteeing the properties 
described above for an arbitrary scalar valued neuron (spin) model. Optimal means 
especially that although the network might start initially far from the embedded pattern 
it is still able to retrieve it. 

In the following we answer this question by presenting a general scheme in order to 
express the mutual information as a function of the relevant macroscopic parameters like, 
e.g., overlap with the embedded patterns, activity, . . . and constructing a hamiltonian 
from it for general Q-state neural networks. For Q = 2, we find back the Hopfield model 
for biased patterns || ensuring that this hamiltonian is optimal in the sense described 
above. For Q = 3, we obtain a Blume-Emery-Grifliths type hamiltonian confirming 
the result found in |4j]. However, in that paper the properties of this hamiltonian have 
not been discussed, rather the dynamics for an extremely diluted version of the model 
has been treated. Hence, we derive the thermodynamic phase diagram for the fully 
connected network modeled by this hamiltonian and show, e.g., that it has the largest 
retrieval region compared with the other three-state models known in the literature. 

Consider a network of iV neurons Ej, i = 1, . . . , N, taking different values, <7j, from a 
discrete set of Q states, S, with a certain probability distribution. In this network we 
want to store p = aN patterns Hf, /i = 1, . . . ,p, taking different values, £f , out of the 
same set S with a certain probability distribution. Both sets of random variables are 
chosen to be independent identically distributed with respect to i 

We want to study the mutual information between the neurons and the patterns, 
a measure of the correlations between them. At this point we note that, since the 
interactions are of infinite range, the neural network system is mean-field such that the 
probability distributions of all the neurons and all the patterns are of product type, 
e.g., p({<Tj}) = riiM "*)- Furthermore, in a statistical mechanical treatment any order 
parameter M , being a function of the neurons and the patterns, can be written in the 
thermodynamic limit iV — ► oo, as 



where the left hand side is the configurational average, £ = and where p S3 (u, £) is 
the joint probability distribution of the neurons and the patterns. Hence, we can forget 
about the index % in the sequel. 




(1) 



i=l 
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The mutual information / for the random variables £ and H is then given by (see, 
e.g., §) 

A good network would be one that starts initially far from a pattern but is still 
able to retrieve it. Far in this context means that the random variables, neurons and 
patterns, are almost independent. When £ and 5 are completely independent, then 
Ps~ = PuPs- Consequently, when they are almost independent we can write 

Pee = PsPs + A EH (3) 

with A ES small pointwise. We remark that 

5>«(^) = 0. (4) 

Plugging the relation (|3|) into the definition (^) and expanding the logarithm up to 
second order in the small correlations, A, we find using (|J) 

= \ £ + = 5«(A, a (.,«))\} e + 0(A«) (5) 

with obvious notation. This approximation is in fact very natural. It is the average 
over the square of the difference between the correlated and uncorrelated probability 
distribution. 

We remark that still all the patterns are contained in (^). Without loss of 
generality we consider only one condensed paterns and omit the index \i in the sequel. 
Consequently, only first and second order correlations of the variables will be used, 
higher order ones can be neglected. 

Next, we want to express A ES in terms of macroscopic, physical quantities of the 
system (order parameters). Refering to ([!]) we write down the following Q 2 moments 

mCd = &^E<^=< ffe O { ., C } = E^o^ (6) 

i cr,£ 

with c,d— 0, . . . , Q — 1 and using the notation 0° = 1. We remark that m 00 = 1 such 

that we have in general Q 2 — 1 independent parameters specifying p E3 (cr, £). 

Up to now the derivation is valid for general Q-state scalar-valued neurons. To fix 

the ideas we choose the neuron states as 

c - 1 

o- c = -1 + q _ - with c= 1, ...,Q. (7) 
This choice corresponds to a Q-state Ising-type architecture leading to 



m cd 



T cd *y pU^Q , with T cd *y = (-1 + ^-i) (-1 



x,y=l 

In a similar way we introduce A by 



x — i V Z' -, , y ~ i 
Q-i 



m c0 



X^ftM, m 0c = ^A> H (e,), A« = ^-1 + |~iy . (9) 
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Finally, by introducing the inverse transformations S and B 

T cdxy S xyc >d' = 5 a ,a> h,v, ^ A cx B xd = 5 c ,d , (10) 

x,y x 

we can write the approximation of the mutual information up to order A 2 as 

f _ 1 y (E«j Sx ycd m cd - E cd B xc B yd m od ) 2 

2^ E cd Bx C B yd mc° m ™ 1 > 

where we have left out the dependence on £ and S. 

Using (ID, the expression ( |i~l|) for / can be written in terms of configurational 
averages of the system. In this way, for large N, we can express / as a function of 
the microscopic variables Oi and Using (|TTD for every pattern /i, summing over 
[i and multiplying by N we get an extensive quantity, denoted by In, which grows 
monotonically as a function of the correlation between spins and patterns. Therefore, 
H = —In is a good candidate for a hamiltonian. 

Configurational averages also enter into the denominator of flTTl). Since we are 
mainly interested in the correlations between spins and patterns, rather than in the 
respective single probability distributions we assume that the latter are equal such that 
we use the known distribution of the patterns in the denominator. 

What we have presented up to now is a scheme to calculate a hamiltonian for a 
general Q-state network with an Ising-type architecture using mutual information. 

Next, we discuss some specific examples, Q = 2 and Q = 3, in detail. We start 
with Q = 2 states. Given the probabilities associated with each state, the inversion of 
the transformations T (||) and A @ leads to 

1 — m 10 — m 01 + m 11 1 — m 10 + m 01 — m 11 
Pe H (<t, = 8 a -i 5$-! + %i (12) 

1 + m 10 — m 01 — m 11 1 + m 10 + m 01 + m 11 
H <V,i H ^ °cr,i <5&i • 

The distributions p E (cr) and p H (0 can be found by summing out £ and a respectively. 
Using these distributions, / becomes 

~ (m 11 — m 01 m 10 ) 2 

= 2(l-(m 01 ) 2 )(l-(m 10 ) 2 ) ' (13) 
Substituting the averages over the probability distributions by configurational averages 
and putting m 10 = m 01 = h in the denominator, where b is the bias of the patterns, we 
get 

H --K^Wf- N ^(^^- ba ) ■ < 14 > 

This hamiltonian can be written as 

h = -\J2 j ^ with -^ Mi^ Etff-wff- 6 ) (15) 

ij n 

and this is precisely the Hopfield hamiltonian with || and without bias. 
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We remark that a particularly nice aspect of this treatment is that the adjustment 
of the learning rule due to the bias enters in a natural way. Furthermore, we learn 
that the Hopfield hamiltonian is the optimal two-state hamiltonian in the sense that 
we started from the mutual information calculated for an initial state having a small 
overlap with the condensed pattern. This confirms a well-known fact in the literature. 

One could ask what happens when one assumes that initially the state of the 
network is already close to the embedded pattern. Since the mutual information for 
fully correlated random variables is equal to the entropy, S(E), || one is interested in 
(assuming again one condensed pattern) F(E, E) = J(E, H) — S(E) . We define 

fefotf = E b&C^O <W%>* +p£(°'.0(i - <W<%,e)] (16) 

with obvious notation. Writing 

?4 = p E +A' E3 (17) 

with A' EH small pointwise for large correlations and assuming that p^L(<J, £) = for Va, £, 
in order to retain only the polynomial behaviour, we expand F and find 

^^)=^E (A yri ))2 + °( A/3 )- ^ 

Expressing F in terms of the order parameters as in (|11]) , we get the hamiltonian || [J 

1 V (g ~ b) 

for one pattern. In || it is shown that this hamiltonian can store an infinite number of 
patterns. This is consistent with the intuitive idea that it is possible to store a lot of 
patterns as long as the network state is initially close to them. 

For Q = 3 we focus, without loss of generality, on the case where the distributions 
are taken symmetric around zero, meaning that all the odd moments vanish. Following 
the scheme proposed above we arrive at 

J=- ( m "y2 + i f m 22 _ 02 20\2 (2Q \ 

2m 02 m 2 y > ^ 2 m 02 m 20 (l -m 02 )(l - m 20 ) 1 } ' 1 } 

Identifying m 02 = a as the activity of the patterns, m 20 = q as the activity of the 
neurons, m 11 = m as the overlap, m 22 = n as the activity overlap [0, and defining 
/ = n — aq we arrive at 

j = li_ m 2 + l 1 ; 2 , (21) 
2 a 2 2 (o(l - a)) 2 V ; 

This leads to a hamiltonian 



H = N{\ — b 2 ) — ra^) with ^ = lE^T3^ ( 19 ) 



2 

hj hi 



with 
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This hamiltonian resembles the Blume- Emery-Griffiths (BEG) hamiltonian [ID]. The 
derivation above confirms the result found in [TJ] starting from an explicit form of the 
mutual information for Q = 3. In that paper the dynamics has been studied for an 
extremely diluted asymmetric version of this model. Here we want to discuss the fully 
connected architecture and derive the thermodynamic phase diagram, which has not 
been done in the literature, in order to compare it with the other Q = 3 state models 
known. 



In order to calculate the free energy we use the standard replica method [TT 



Starting from the replicated partition function and assuming replica symmetry we obtain 



+ 



+ 



a: 



2(3 

Mix 



+ 



a 



B V i 



2/51-0 ' 2(1- X ) 2 ' 2(1 
with v denoting the condensed patterns and 



+ 

DsDtlogTr^exp ((3H 



(24) 



H = A 



a 



V 



m v ^ v + \/ars 



+ Ba 2 



l v rf + y/aut 



a 2 aA\ a 4 aB <b 

+ 77T, \ + ~ 



and A = 1/a, B = l/(a(l 
rfs(2vr)- 1 / 2 exp(-s 2 /2), and 



a)), 



2(1 -x) ' 2(1-0) 
Ds and Dt Gaussian measures, Ds 



(25) 



X = Ap(q Q - q x ) 
For Q = 



BP(pa-pi) 



Qi 



u 



Pi 



(i-x) 2 ' (i- 

the order parameters are defined as follows 



(26) 



m,, 



AU'jDsDt (a) 



\ v = B{rf jDsDt {a 2 } 



go 



Po 



JDsDt (a 2 ). 



Pi 



JDsDt (a) 
JDsDt (a 2 ) 



2\ 2 





(27) 



with the small brackets (. . .)p denoting the usual thermal average. We recall that m u is 
the overlap, l u is related to the activity overlap, q is the activity of the neurons and qi 
and pi are Edwards-Anderson parameters. For one condensed pattern the index v can 
be dropped. 

Solving the fixed-point equations for the order parameters and considering uniform 
patterns (a = 2/3), we obtain a rich T — a phase diagram (see for more details). 
The phases that are important from a neural network point of view are presented in 
figure 0. The border of the retrieval phase (m > 0, I > 0) is denoted by a thick full 
line. The most important result is that the capacity of the BEG neural network is 
much larger than that of other Q = 3 models. Compared with the Q = 3-Ising model 
12]|, e.g., it is almost twice at T = 0. Of course this is due to the second term in the 



hamiltonian (22) . A study of the dynamics of this model, which is in progress, confirms 
this result. Another new feature in the phase diagram, compared with other models, 
is the so-called quadrupolar phase (m = but I > 0) which lies below the thin full 




Figure 1. Q = 3 T — a phase diagram for uniform patterns. The meaning of the lines 
is explained in the text. 



line. It is present in the original BEG spin-model ]T0[ and has also been seen for the 
extremely diluted network model |4j]. In this phase the active neurons (±1) coincide 
with the active patterns but the sign does not. This means that although the system 
does not succeed in retrieval the information content is nonzero. For a = 2/3 this phase 
lies completely within the retrieval phase but for other values of a (e.g., a = 0.8) it does 
not . Besides these phases one also has a spin-glass phase and a paramagnetic phase 
(separated by the broken line in figure [I]). The latter coexists with the retrieval phase 
in a region near the T-axis. We refer to |7| for further details. 



In conclusion, we have presented a method starting from the mutual information beween 
the neurons and the patterns to derive an optimal hamiltonian for a general Q-state 
neural network. The derivation assumes that the correlations between the neurons and 
patterns are small initially, and thus guarantees optimal retrieval properties (loading 
capacity, basin of attraction) for the model. For Q = 2, we find back the Hopfield 
hamiltonian for biased patterns, while for Q = 3 we find the Blume-Emery-Griffiths 
hamiltonian. We have derived the phase diagram for this fully connected BEG model 
confirming that the capacity is larger than the one for related models. We believe that 
similar results can be obtained for vector models and other architectures. An extended 
version of the work on the BEG fully connected neural network will appear in j7|. 
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